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Abstract —Motivated by recently derived fundamental limits 
on total (transmit + decoding) power for coded communication 
with VLSI decoders, this paper investigates the scaling behavior 
of the minimum total power needed to commnnicate over AWGN 
channels as the target blt-error-probabillty tends to zero. We 
focus on regular-LDPC codes and iterative message-passing 
decoders. We analyze scaling behavior nnder two VLSI complex¬ 
ity models of decoding. One model abstracts power consumed 
in processing elements (“node model”), and another abstracts 
power consumed in wires which connect the processing elements 
(“wire model”). We prove that a coding strategy nslng regular- 
LDPC codes with Gallager-B decoding achieves order-optimal 
scaling of total power under the node model. However, we also 
prove that regular-LDPC codes and iterative message-passing 
decoders cannot meet existing fundamental limits on total power 
under the wire model. Further, if the transmit energy-per-bit is 
bonnded, total power grows at a rate that is worse than uncoded 
transmission. Complementing our theoretical results, we develop 
detailed physical models of decoding implementations using 
post-layont clrcnit simulations. Onr theoretical and numerical 
results show that approaching fundamental limits on total power 
reqnires increasing the complexity of both the code design and 
the corresponding decoding algorithm as communication distance 
is increased or error-probability is lowered. 

Index Terms —Low-density parity-check (LDPC) codes; Itera¬ 
tive message-passing decoding; Total power channel capacity; 
Energy-efflclent communication; System-level power consump¬ 
tion; Circuit power consumption; VLSI complexity theory. 

I. Introduction 

Intuitively, the concept of Shannon capacity captures how 
much information can be communicated across a channel 
under specified resource constraints. While the problem of 
approaching Shannon capacity under solely transmit power 
constraints is well understood, modern communication often 
takes place at transmitter-receiver distances that are very 
short (e.g., on-chip communication [3], short distance wired 
communication [4], and extremely-high-frequency short-range 
wireless communication [5]). Empirically, it has been observed 
that at such short distances, the power required for processing 
a signal at the transmitter/receiver circuitry can dominate the 
power required for transmission, sometimes by orders of mag¬ 
nitude [4], [6], [7]. For instance, the power consumed in the 
decoding circuitry of multi-gigabit-per-second communication 
systems can be hundreds of milliwatts or more (e.g., [4], [8]), 
while the transmit power required is only tens of milliwatts [7] . 
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Thus, transmit power constraints do not abstract the relevant 
power consumed in many modem systems. 

Shannon capacity, complemented by modern coding- 
theoretic constructions [9], has provided a framework that is 
provably good for minimizing transmit power (e.g., in power- 
constrained AWGN channels). In this work, we focus on a ca¬ 
pacity question that is motivated by total power: at what max¬ 
imum rate can one communicate across a channel for a given 
total power, and a specihed error-probability? Alternatively, 
given a target communication rate and error-probability, what 
is the minimum required total power? The hrst simplifying 
perspective to this problem was adopted in [10], [11], where 
all of the processing power components at the transmitter and 
the receiver were lumped together. However, processing power 
is influenced heavily by the specific modulation choice, coding 
strategy, equalization strategy, etc. [4], [6]. Even for a fixed 
communication strategy, processing power depends strongly 
on the implementation technology (e.g., 45 nm CMOS) and 
the choice of circuit architecture. 

Using theoretical models of VLSI implementations [12], 
recent literature has explored fundamental scaling limits [6], 
[13], [14], [15] on the transmit + decoding power consumed by 
error-correcting codes. These works abstract energy consumed 
in processing nodes [6] and wires [14], [13], [15] in the 
VLSI decoders, and show that there is a fundamental tradeoff 
between transmit and decoding power. 

In this work, we examine the achievability side of the 
question (see Fig. 1): what is the total power that known code 
families and decoding algorithms can achieve? To address this 
question, we first provide asymptotic bounds (Sections IV-V) 
on required decoding power. To do so, we restrict our analysis 
to binary regular-LDPC codes and iterative message-passing 
decoding algorithms. Our code-family choice is motivated by 
both the order-optimality of regular-LDPC codes in some 
theoretical models of circuit power [6], and their practical 
utility in both short [16] and long [17] distance settings. Recent 
work of Blake and Kschischang [18] also studied the energy 
complexity of LDPC decoding circuits, and an important 
connection to this paper is highlighted in Section VII. 

Within these restrictions we provide the following insights: 

1) Wiring power, which explicitly brings out physical con¬ 
straints in a digital system [19], costs more in the 
order sense than the power consumed in processing 
nodes. Thus, the commonly used metric for decoding 
complexity — number of operations — underestimates 
circuit energy costs. 

2) Shannon capacity is the maximal rate one can communi¬ 
cate at with arbitrary reliability while the transmit power 
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is he[A fixed. However, when total power minimization 
is the goal, keeping transmit power fixed while bit-error 
probability approaches zero can lead to highly subopti- 
mal decoding power. For instance, we prove that (Theo¬ 
rems 3, 4, 5) at sufficiently low bit-error probability, it is 
more total power efficient to use uncoded transmission 
than regular-LDPC codes with iterative message-passing 
decoding, if using fixed transmit power. However, if 
transmit power is allowed to diverge to infinity, we 
show that regular-LDPC codes can outperform uncoded 
transmission in this total power sense. 

3) We prove (Corollary 2) that a strategy using regular- 
LDPC codes and the Gallager-B decoding achieves 
order-optimal scaling of total power when processing 
power is dominated by nodes as opposed to wires (see 
Section IV-C). 

4) However, we also prove a lower bound (Theorem 3) that 
holds for all regular-LDPC codes with iterative message¬ 
passing decoders for the case where processing power is 
dominated by wires, and we show that a large gap exists 
between this lower bound and existing fundamental 
limits (see Section V-C). 

To obtain insights on how an engineer might choose a 
power-efficient code for a given system, we then develop 
empirical models of decoding power consumption of 1-bit 
and 2-bit message-passing algorithms for regular-LDPC codes 
(Section VI-C). These models are constructed using post¬ 
layout circuit simulations of power consumption for check- 
node and variable-node sub-circuits, and generalizing the 
remaining components of power to structurally similar codes. 

Shannon-theoretic analysis yields transmit-power-centric re¬ 
sults, which are plotted as “waterfall” curves (with correspond¬ 
ing “error-floors”) demonstrating how close the code performs 
to the Shannon limit. There, the channel path-loss can usually 
be ignored because it is merely a scaling factor for the term 
to be optimized (namely the transmit power), thereby not 
affecting the optimizing code. Since we are interested in total 
power, the path-loss impacts the code choice. For simplicity 
of understanding, path-loss is translated into a more relatable 
metric — communication distance — using a simple model 
for path-loss. The resulting question is illustrated in Fig. 1(b); 
At a given data-rate, what code and corresponding decoding 
algorithm minimize the transmit + decoding power for a given 
transmit distance and bit-error probability? 

In Section VI-C, we present optimization results for this 
question in a 60 GHz communication setting using our models. 
This particular setting is chosen not just because of the 
short distance, but also because the results highlight another 
conceptual point we stress in this paper: 

5) Approaching total power capacity requires an increase 
in the complexity of both the code design and the 
corresponding decoding algorithm as communication 
distance is increased, or bit-error probability is lowered. 

The results presented in this paper have some limi¬ 
tations. First, we only consider a limited set of coding 
strategies, and while the results and models presented here 
extend easily to irregular LDPC constructions, they are 
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Fig. 1: a). The question explored in Sections II-V: How fast does total power 
diverge to oo as bit-enor probability Pg 0 for regular-LDPC codes and 
iterative message-passing decoding algorithms? b). The question explored in 
Section VI-C: what is the most power-efficient pairing of a code and decoding 
algorithm for a given distance and bit-error probability? 


not necessarily applicable to all decoders. Second, mod¬ 
ern transceivers [20] contain many other processing power 
sinks, including analog-to-digital converters (ADCs), digital- 
to-analog converters (DACs), power amplifiers, modulation, 
and equalizers, and the power requirements of each of these 
components can vary' based on the coding strategy. While 
recent works have started to address fundamental limits [21] 
and modeling [22] of power consumption of system blocks 
from a mixed-signal circuit design perspective, tradeoffs with 
code choice of these components remain relatively unex¬ 
plored. Hence, while analyzing decoding power is a start, 
other system-level tradeoffs should be addressed in future 
work. It is also of great interest to understand tradeoffs at a 
network level (see [6]), where multiple transmitting-receiving 
pairs are communicating in a shared wireless medium. In 
such situations, one cannot simply increase transmit power 
to reduce decoding power: the resulting interference to other 
users needs to be accounted for as well. 

The remainder of the paper is organized as follows. Sec¬ 
tion II states the assumptions and notation used in the paper. 
Sections II-C to II-G introduce theoretical models of VLSI 
circuits and decoding energy. Preliminary results are stated 
in Section III, which are used to analyze decoding energy in 

*For example, the resolution of ADCs used at the receiver may vary with 
the code choice by virtue of the fact that changing the rate of the code may 
require a change in signaling constellation (when channel bandwidth and data- 
rate are fixed). 






3 


Sections IV and V, in the context of the question illustrated 
in Fig. la) (obtaining the scaling behavior). Section VI dis¬ 
cusses circuit-simulation-based numerical models of decoding 
power, in the context of the question illustrated in Fig. lb). 
Section VII concludes the paper. 

II. System and VLSI models eor asymptotic 

ANALYSIS 

Throughout this paper, we rely on Bachmann-Landau nota¬ 
tion [23] (i.e. “big-O” notation). We first state a preliminary 
definition that is needed in order to state a precise definition 
of the big-O notation that we use in this paper. 

Definition 1. A” C R is a right-sided set if Va; S <T, By S A' 
such that y > X. 

Some examples of right-sided sets include M, N, and inter¬ 
vals of the form [a, oo), where a is a constant. We now state 
the Bachmann-Landau notation for non-negative real-valued 
functions defined on right-sided sets^. 

Definition 2. Let / : A' — >• R-° and g : X ^ R-° be two 
non-negative real-valued functions, both defined on a right¬ 
sided set X. We state 

1) f{^) = if 3x1 G A” and Ci > 0 s.t. 

fix) < cigix), Vx > xi. 

2 ) /(x) = fi(g(x)) if 3 x 2 £ and C 2 > 0 s.t. 
f(x) > C 2 g(x), Vx > X 2 . 

3) f(x) = 0(g(x)) if 3 x 3 £ ^ and C 4 > C 3 > 0 s.t. 
csgix) < fix) < Cigix), Vx > X 3 . 

We will also need a Bachmann-Landau notation for two 
variable functions [24, Section 3.5]: 

Definition 3. Let u : X x y —>■ R-° and v : X x y —>■ R-° 
be two non-negative real-valued functions, both defined on the 
Cartesian product of two right-sided sets X and y. We state 

1) uix,y) = Oivix,y)) if 3M S R and ci > 0 s.t. 
uix,y) < ci'u(x,j/), Vx,j/ > M. 

2) uix,y) = n(x(x, 2 /)) if 3M S R and C 2 > 0 s.t. 
uix,y) > C 2 vix,y), Vx,j/ > M. 

3) uix,y) = 0(u(x,j/)) if 3M € R and C 4 > C 3 > 0 s.t. 
C 3 vix,y) < uix,y) < C 4 x(x,t/), Vx,?/ > M. 

We will often apply Definitions 2 and 3 in the limit as 
bit-error probability 0 , where the definitions can be 

interpreted as applied to a function with an argument as 
it diverges to 00 . All logarithm functions log(-) are natural 
logarithms unless otherwise stated. 

A. Communication channel model 

We assume the communication between transmitter and 
receiver takes place over an AWGN channel with fixed at¬ 
tenuation. The transmission strategy uses BPSK modulation, 
and a (d^,dc)-regular binary LDPC code of design rate 
R — 1 — ^ [25] (which is assumed to equal the code rate). 

^Bounded intervals that are open on the right such as (—1, 0) or [0, 5) are 
also right-sided sets. Definition 2 can still be applied to functions restricted 
to such sets, but we will not consider such functions in this paper. 


The blocklength of the code is denoted by n, and the number 
of source bits is denoted by A: = nR. The decoder performs a 
hard-decision on the observed channel outputs before starting 
the decoding process, thereby first recovering noisy codeword 
bits transmitted through a Bina ry Sy mmetric Channel (BSC) 
of flip probability po = q(^ is the input 

energy per channel symbol and ^ is the noise power. Q (•) 
is the tail probability of the standard normal distribution, 
Q(x) = du. The transmit power Pt is assumed 

to be proportional to modeling fixed distance and constant 
attenuation wireless communication. Explicitly we assume 
^ = tiPt for some constant p > 0. Using known bounds 

on the Q-function [26], < Qix) < ^ : 

p-vPr p-nPr 

-^-= < PO < ^ (1) 

V^ttt]Pt 

The focus of this paper is on the analysis of the “total” 
power required to communicate on the above channel, as the 
target average bit-error probability Pp —> 0. Our simplified 
notion of total power is defined below. 

Definition 4. The total power, Ptotai. consumed in communi¬ 
cation across the channel described in II-A is defined as 

Ptotal = Pt + Pdbc: (2) 

where Pt and Poec the power spent in transmission and 
decoding, respectively. 

The channel model helps analyze the transmit power com¬ 
ponent in ( 2 ), but a model for decoding power is also needed. 
In the next section, we provide models and assumptions for 
decoding algorithms and implementations that are used in the 
paper. We allow Pt and Poec to be chosen depending on Pp, g, 
and the coding strategy. Throughout the paper, the minimum 
total power for a strategy is denoted by Ptotal,min and the 
optimizing transmit power by P^. 

B. Decoding algorithm assumptions 

The general theoretical results of this paper (Lemma 2, 
Theorem 3) hold for any iterative message-passing decoding 
algorithm (and any number of decoding iterations) that satis¬ 
fies “symmetry conditions” in [25, Def. 1] (which allow us 
to assume that all-zero codeword is transmitted). Thus, each 
node only operates on the messages it receives at its inputs. 
We note that the sum-product algorithm [27], the min-sum 
algorithm [28], Gallager’s algorithms [29], and most other 
message-passing decoders satisfy these assumptions. For the 
constructive results of this paper (Corollary 1, Corollary 2, 
Theorem 4, Theorem 5) we focus on the two decoding 
algorithms originally proposed in Gallager’s thesis [29], that 
are now called “Gallager-A” and “Gallager-B” [25]. For these 
results, we will use density-evolution analysis [9] to analyze 
the performance^, for which we define the term “independent 

^In practice, decoding is often run for a larger number of iterations because 
at large blocklengths, bit-error probability may still decay as the number of 
iterations increase. In that case, density-evolution does not yield the correct 
bit-en'or probability, as it will vary based on the code construction [30]. 
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iterations” as follows: 

Definition 5. An independent decoding iteration is a decoding 
iteration in which messages received at a single variable or a 
check node are mutually independent. 

We will denote number of independent iterations^ that an 
algorithm runs as A^iter- This quantity is constrained by the 
girth [25] of the code, defined as the length of the shortest 
cycle in the Tanner graph of the code [31] as follows: for a 
code with girth g, the maximum value of A^iter is L^^J- 

C. VLSI model of decoding implementation 

Theoretical models for analyzing area and energy costs of 
VLSI circuits were introduced several decades ago in computer 
science. These include frameworks such as the Thompson [12] 
and Brent-Kung [32] models for circuit area and energy 
complexity (called the “VLSI models”), and Rent’s rule [33], 
[34]. Our model for the LDPC decoder implementation in this 
paper is an adaptation of Thompson’s model [12], and it entails 
the following assumptions: 

1) The VLSI circuit includes processing nodes which per¬ 
form computations and store data, and wires which 
connect them. The circuit is placed on a square grid 
of horizontal and vertical wiring tracks of finite width 
A > 0, and contact squares of area A^ at the overlaps of 
perpendicular tracks. 

2) Neighboring parallel tracks are spaced apart by width A. 

3) Wires carry information bi-directionally. Distinct wires 
can only cross orthogonally at the contact squares. 

4) The layout is drawn in the plane. In other words, 
the model does not allow for more than two metal 
layers for routing wires in the manner that modern IC 
manufacturing processes do (see Section II-Gl). 

5) The processing nodes in the circuit have finite memory 
and are situated at the contact squares of the grid. They 
connect to wires routed along the grid. 

6 ) Since wires are routed only horizontally and vertically, 
any single contact has access to a maximum of 4 
distinct wires. To accommodate higher-degree nodes, a 
processing element requiring x external connections (for 
a: > 4) can occupy a square of side-length xX on the 
grid, with wires connecting to any side. No wires pass 
over the large square. 

A models the minimum feature-size which is often used 
to describe IC fabrication processes. We refer to this model 
as Implementation Model (A). The decoder is assumed to be 
implemented in a “fully-parallel” manner [ 8 ], i.e. a processing 
node never acts as more than one vertex in the Tanner 
graph [31] of the code. Each variable-node and check-node of 
an LDPC code is therefore represented by a distinct processing 
node in the decoding circuit. As an example. Fig. 2(a) shows 
the Tanner graph for a (7,4)-Hamming code and Fig. 2(b) 
shows a fully-parallel layout of a decoder for the same code. 
In Sections II-E, II-F we will describe two models^ of energy 

our constructive results, we constrain the decoder to only perform 
independent iterations. Thus, the number of independent iterations is the 
same as the number of iterations for those results, but it emphasizes on the 
requirement on the code to ensure that the girth is sufficiently large. 
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Fig. 2: The Tanner graph (a) of a (7,4)-Hamming code and a fully parallel 
decoder (b) drawn according to Implementation Model (A). Each vertex in 
the Tanner graph corresponds to a processing node in the layout and each 
edge in the Tanner graph corresponds to a wire connecting distinct nodes. 


consumption for the VLSI decoder. 

D. Time required for processing 

In order to translate the model of II-E to a power model, 
we need the time required for computation (the computation 
time is measured in seconds and is different from the number 
of algorithmic iterations). The computations are assumed to 
happen in clocked iterations, with each iteration consisting 
of two steps: passing of messages from variable to check 
nodes, and then from check to variable nodes. If the decoding 
algorithm requires the exchange of multi-bit messages, we 
assume the message bits can be passed using a single wire. 

We denote the decoding throughput (number of source bits 
decoded per second) by i?data- Because a batch of k source 
bits are processed in parallel, the time available for processing 
is Tproc = seconds. 


E. Processing node model of decoding power 


Definition 6 (Node Model (^node))- The energy consumed 
in each variable or check node during one decoding iteration 
is i?node- This constant can depend on A, and dc- The 
total number of nodes at the decoder^ is rinodes = n + {n — 
k) = 2n — k. The total energy consumed in Titer decoding 
iterations is L;„odes = T^nodennodesAter- The decoding power 

;„ p , _ Epodoa f?nQd0(2n—fc)riter p _ f 

nodes — Tproc — k -^^data — snode'iter* 


C _ TP /2 ^\ T? _ Eno(ie{dv-\-dc) 

ocic, ^node — -^node \ -^data — {dc — d^) -^data- 

Note that Titer need not be tied to Witer- 

This model assumes that the entirety of the decoding energy 
is consumed in processing nodes, and wires require no energy. 
In essence, this model is simply counting the number of 
operations performed in the message-passing algorithm. The 
next energy model complements the node model by accounting 
for energy consumed in wiring. 


F. Message-passing wire model of decoding power 
Definition 7 (Wire Model (^wire))- The decoding power is 

Twires l^unit — area^wiresTg^pp^y/clock? where Gunit—area 1^ 

the capacitance per unit-area of a wire, Vsuppiy is the supply 


^Both models can also be used simultaneously. However, for simplicity, we 
present the results for the two models separately. 

®In practice, many decoder implementations actually contain more than 
n — k check nodes in order to break up small stopping-sets in the code. 
However, we do not consider such decoders in this paper. 
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voltage of the circuit, /dock is the clock-frequency of the 
circuit, and Awires is the total area occupied by the wires 
in the circuit. The parameters Cunit-area Vguppiy are 
technology choices that may depend on A, dy, and dc- The 
parameter /dock also may depend on A, dy, dc, Rdata, and 
the decoding algorithm. For simplicity, we write /wire = 

C'unit—area^supplyfclock and Twires — /wire^wires- 


Wires in a circuit consume power whenever they are 
“switched,” i.e., when the message along the wire changes 
its value^. The probability of wire switching in a message¬ 
passing decoder depends on the statistics of the number of 
errors in the received word. These statistics depend on the flip- 
probability of the channel, which is controlled by the transmit 
power. Further, as decoding proceeds the messages also tend to 
stabilize, reducing switching and hence the power consumed 
in the wires. Activity-factor [19] could therefore be introduced 
in the Wire Model via a multiplicative factor between 0 and 
1 that depends on Px, rj, and the decoding algorithm, but 
modeling it accurately would require a very careful analysis. 


G. On modern VLSI technologies and architectures 

The VLSI model of Section II-C and the assumptions 
made about the decoding architecture may seem pessimistic 
compared to the current state-of-the-art. Flowever, in this 
section we justify our choices by explaining how many of 
the architecture and technology optimizations that are helpful 
in current practice have no impact on the conclusions derived 
by our theoretical analysis. 

1) Multiple routing layers: Modern VLSI technologies 
allow for upwards of 10 metal layers for routing wires [35]. 
While this helps significantly in reducing routing congestion 
in practice (i.e., at finite blocklengths and non-vanishing error- 
probabilities), it has no impact on asymptotic bounds on total 
power. As proved in [12, Pages 36-37], for a process with L 
routing layers, the area occupied by wires is at least 
where A^ires is the area occupied by the same circuit when 
only one metal layer is used. As long as L cannot grow 
with the number of vertices in the graph (it would be very 
unrealistic to assume it can), it has only a constant impact on 
wiring area lower bounds (see Lemmas 5, 6) and no impact 
(since one can always restrict routing to a single layer) on 
upper bounds (see Lemma 7). It will become apparent later in 
the paper therefore, that multiple routing layers do not effect 
any of the theoretical results we derive. 

On the other hand, having multiple active layers with 
fine-grained routing between layers can lead to asymptotic 
reductions in wiring area for some circuits [36]. Flowever, 
as it relates to practice, this is far beyond the reach of any 
commercial foundry in existence today. Methods for designing 
and fabricating such circuits (which rely on emerging nan¬ 
otechnologies and emerging non-volatile memories [37]) are 
only now starting to be considered in research settings. 

2) Architectural optimizations: Fully parallel, one clock- 
cycle per-iteration decoders are not commonly used in prac¬ 
tice. Instead, serialization by dividing the number of physical 

^Switching consumes energy because wires act as capacitors that need to be 
charged/discharged. If voltage is maintained, little additional energy is spent. 


nodes in the circuit by a constant factor and using time¬ 
multiplexing to cut down on wiring is often performed [4], 
[8]. This also requires a corresponding multiplication for the 
clock-frequency /dock of the circuit to maintain the same data- 
rate. Recall, that dynamic power consumed in wires is pro¬ 
portional to /clock [19]. While decrease in wire 

capacitance may allow the supply to be scaled down (leading 
to a reduction in power) without compromising timing, it is 
not possible to scale it down indefinitely, since transistors have 
a nonzero subthreshold slope [38, Section 2]. In other words, 
once a lower limit on supply voltage is reached, even if Cwires 
can be made to decay on the order of one would no longer 
achieve power savings due to the corresponding increase in 
/clock- Thus, behavior of total power in the large blocklength 
limit will remain unchanged. Such architectural optimizations 
do however, have a big impact in practice (e.g., at finite- 
blocklengths) since changes in constants matter then. 

3) Leakage power: Later in the paper (Sections IV-C, V-C), 
we will compare bounds on total power under the Node and 
Wire Models. It will turn out that the two models lead to very 
different insights, and the Wire Model results appear far more 
pessimistic. Which model then is closer to reality? It turns out 
that the Node Model is actually very optimistic. It assumes 
that each node consumes only constant energy per-iteration, 
irrespective of the clock period. From a circuit perspective, 
this is equivalent to assuming that the power consumption 
inside nodes is entirely dynamic [19], as the energy per- 
iteration does not increase with the clock-period. This is far 
from the reality in modern VLSI technologies. Transistors are 
not perfect switches [39], and every check-node and variable- 
node will consume a constant amount of leakage power while 
the decoder is on, regardless of clock period and switching 
activity. It is easy to see then, that even if the transistor leakage 
is very small, the decoding power must scale as f2(n). For 
instance, even if the architecture is highly serialized, there is 
still leakage in each of the 0 (n) sequential elements (e.g., 
flip-flops, latches, or RAM cells) needed to store messages. 
It will become apparent later in the paper that this simple 
analysis is enough to establish identical conclusions to the 
lower bounds of Theorems 3, 4, and 5. Thus, the asymptotics 
of total power under the Wire Model should be viewed as 
much better predictions of what would actually happen inside 
the circuit at infinite blocklengths. 

III. Preliminary results 

In this section, we provide some preliminary results that 
will be useful in Sections IV and V. These include general 
bounds on the blocklength of regular-LDPC codes and bounds 
on the minimum number of independent iterations needed for 
Gallager decoders to achieve a specific bit-error probability. 

A. Blocklength analysis of regular-LDPC codes 

Lemma 1. For a given girth g of a (dy, dc)-regular LDPC 

code, a lower bound on the blocklength n is 

n > [(d„ - 1) (dc - 1)]''"^^ , (3) 

and an upper bound on the blocklength is given by 

n < 2{dy -F dc)dydc{2dydc + 1)^®. 


(4) 
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Proof: For the lower bound, see [14, Appendix I], and 
for the upper bound, see [14, Claim 2]. □ 


Lemma 2. For a (dy, dc)-regular binary LDPC code decoded 
using any iterative message-passing decoding algorithm for 
any number of iterations, the blocklength n needed to achieve 
bit-error probability is 


n 


n = < 


{ dfCdJi) ] logit 


1 I log(<^c-l) \ 
-^+log(d^-l) \ 


(1 + 9Tr)r]PT 




dv > 3. 


dy = 2. 


Here, rj > 0 is the constant attenuation in the AWGN channel 
(see Section II- A). 


Proof: See Appendix A. □ 

Proof Outline: We use a technique for the finite-length 
analysis of LDPC codes from [40]. First, the pairwise error- 
probability for any iterative message-passing decoder is lower 
bounded in terms of n, dy, dy, and ryP-r using an expression for 
the minimum pseudoweight (see Appendix A for definition) 
of the code. Next, due to a simple relationship between 
bit-error probability and pairwise error-probability for binary 
linear codes over memoryless binary-input, output-symmetric 
channels, the bit-error probability can be lower bounded in 
terms of n, dy, dy, and tjPt- Finally, algebraic manipulations, 
an application of (1), and an application of Definition 3 
complete the proof □ 


B. Approximation analysis of Gallager decoding algorithms 

In this section, we bound the number of independent de¬ 
coding iterations required to attain a specific bit-error prob¬ 
ability with Gallager decoders. These bounds are used in 
Sections IV, V-B to prove achievability results for total power. 


enough Py or large enough Pt, the approximation error can 
be bounded by a multiplicative factor between ^ and 1. After 
some algebraic manipulations and an application of (1), we 
apply Definition 2 to establish the first case and Definition 3 
to establish the second case. □ 


Lemma 4. The number of independent decoding iterations 
Witer needed to attain bit-error probability Py with a Gallager- 
B decoder with variable node degree dy > 4 is given by 


0 (^Oglogy^ 



if Pt is held constant. 


if limp^^o = 0. 


Here, p > 0 is the constant attenuation in the AWGN channel 
(see Section H-A). 


Proof: See Appendix C. Importantly, this holds only if 
dy > 4, otherwise Gallager-A and Gallager-B are equivalent. 
Note that in the second case, we assume Pt is a function 
of so both expressions should be interpreted with Def¬ 
inition 2. Further, little generality is lost by the necessary 
condition for the second case, since uncoded transmission 
requires transmit power 0 ^log (see (1)). □ 

Proof Outline: We follow exactly the same steps as the 
proof of Lemma 3, but instead use a higher-order Taylor 
expansion of the recurrence relation for bit-error probability 
under Gallager-B decoding from [29, Eqn. 4.15]. □ 


IV. Analysis of energy consumption in the node 

MODEL 

In this section, we investigate the question: as Py —> 0, how 
does the total power under the Node Model (see Section II-E) 
scale when Gallager decoders (restricted to independent iter¬ 
ations) are used? 


Lemma 3. The number of independent decoding iterations 
Witer needed to attain bit-error probability Py with Gallager- 
A decoding is 



if Pt is held constant. 
if Pt is not held fixed. 


Here, rj > 0 is the constant attenuation in the AWGN channel 
(see Section II- A). 


Proof: See Appendix B. □ 

Proof Outline: We first define (based on the decoding 
threshold over the BSC [25]) appropriate right-sided sets for 
analyzing the asymptotics of Mter as a function of and Pt. 
Then, we apply a first-order Taylor expansion to the recurrence 
relation for bit-error probability under independent iterations 
of Gallager-A decoding from [25, Eqn. (6)] and carefully 
bound the approximation error. We then show that for small 


A. Total power analysis for Gallager-A decoding 


Corollary 1. The optimal total power under Gallager-A 
decoding (restricted to independent iterations) in the Node 
Model (^Txode) for a binary {dy,dy)-regular LDPC code is 


Ftotal.min — 0 



which is achieved by transmit power Pf = 0 log 

Proof: Applying Lemma 3 to the Node Model, if Pt 
is held constant even as Py 0, the power consumed by 
decoding is 0 ^log Since Pt is constant, the total power 

is also Ttotai.bddPT = © instead Pt is allowed 

to grow arbitrarily, the total power is given by 


Ftotal = Pt + Poec = 0 



tjPt J ' 


(5) 


Thus, optimizing the scaling behavior of the total power over 
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transmit power functions Pt 




^totai.min = min© | Pt + ) = © ( a/IoS IT ) > (6 ) 


Pt 


■qPp 


P. 


with optimizing transmit power P^ = 0 ^ ^log ^ j. □ 

B. Total power analysis for Gallager-B decoding 
Corollary 2. The optimal total power under Gallager-B 
decoding {restricted to independent iterations) in the Node 
Model f^nodej/or a binary (dy^dc)-regular LDPC code is 

-Ptotal,min = © ^log log ^ 

which is achieved by transmit power Pf = © (1). 

Proof: If Pt satisfies the condition stated in the second 
case of Lemma 4, the total power in the Node Model is 

log it ^ 

(7) 


-Ptotai = Pt + -Poec = Q \ Pt + 


log 


logi 


yPr 

0-1 


-Ptotal,min = © ^log log . 


techniques for upper and lower bounds on the total wire area 
obtained for different computations in [12], [43], [44], [45]. 
We hrst introduce some graph-theoretic concepts that will 
prove useful in obtaining similar bounds for our problem. 

1) Lower bound on wiring area: We hrst provide a trivial 
lower bound on the wiring area of the decoder for any regular- 
LDPC code implemented in Implementation Model (A). 

Lemma 5. For a {dv,dc)-regular LDPC code of blocklength 
n, the wiring area A^ires under Implementation Model (\) is 


A. 


Wires _ 


> Md,,n. 


Minimizing the scaling behavior of (7), the optimizing transmit 
power is Pf = © (1). The optimal total power is then 


( 8 ) 


In this case the optimizing transmit power is bounded even as 

Pe 0. □ 

C. Comparison with fundamental limits 

Can we reduce the asymptotic growth of total power under 
the Node Model via a better code or a more sophisticated 
decoding algorithm? After all, we limited our attention to reg¬ 
ular LDPCs and simple one-bit message-passing algorithms. 
It was shown in [6] that under the Node Model and a fully- 
parallelized decoding implementation such as Implementa¬ 
tion Model (A), the optimal total power is lower bounded 
by n l^loglog matching Corollary 2. In fact, using a 
code which performs close to Shannon capacity can even 
reduce efficiency for this strategy: if a capacity-approaching 
LDPC code is used instead of a regular LDPC code, the 
inhnite-blocklength performance under the Gallager-B decod¬ 
ing algorithm equals that of regular-LDPCs with Gallager- 
A decoding. In other words, the bit-error probability decays 
only exponentially (and not doubly-exponentially) with the 
number of iterations under Gallager-B decoding if degree-2 
variable nodes are present [41], and [42] shows that degree-2 
variable nodes are required in order to achieve capacity (the 
fraction of degree-2 variable nodes required to attain capacity 
is characterized in [42]). Thus, rather than searching for an 
irregular code that approaches capacity, an engineer might 
be better off using a simpler regular code that approaches 
fundamental limits on total power. 

V. Analysis of energy consumption in the wire 

MODEL 

A. Bounds on wiring area of decoders 

To make use of the energy model of Section II-F, we must 
characterize the total wiring area of the decoder. We rely on 


Proof: There are dyti wires. Each wire has width A and 
minimum length A (no two wires overlap completely). □ 

In his thesis [45], Leighton utilizes the crossing number (a 
property hrst dehned by Turan [46]) of a graph as a tool for 
obtaining lower bounds on the wiring area of circuits. Crossing 
numbers continue to be of interest to combinatorialists and 
graph-theorists, and many difficult problems on hnding exact 
crossing numbers or bounds for various families of graphs 
remain open [47]. We use the following two dehnitions to 
introduce this property. 

Definition 8 (Graph Drawing). A drawing of a graph ^ is a 
representation of Q in the plane such that each vertex of Q is 
represented by a distinct point and each edge is represented by 
a distinct continuous arc connecting the corresponding points, 
which does not cross itself. No edge passes through vertices 
other than its endpoints and no two edges are overlapping for 
any nonzero length (they can only intersect at points). 

Definition 9 (Crossing Number). The crossing number of a 
graph Q, cr{Q), is the minimum number of edge-crossings 
over all possible drawings of Q. An edge-crossing is any point 
in the plane other than a vertex of Q where a pair of edges 
intersects. 

For any graph Q (e.g., the Tanner graph of an LDPC code), 
the wiring area of the corresponding circuit under Implemen¬ 
tation Model (A) is lower bounded as ^wires > A^cr(^). This 
is due to the fact that any VLSI layout of the type described 
in Section II-C can be mapped to a drawing of Q in the 
sense of Definition 8, by simply replacing each processing 
node with a point in the plane and replacing each wire by 
line segments connecting two points. Therefore, the minimum 
number of wire crossings of any layout of Q is cr(C/). Since 
every crossing has area A^, the inequality follows. We now 
need lower bounds on the crossing number of a computation 
graph. In this paper, we make use of the following result [48] 
that improves on earlier results [49], [50], [45] and allows us 
to tighten Lemma 5 for some codes. 

Theorem 1 (Pach, Spencer, Toth [48]). Let Q = {V,E} be a 
graph with girth g > 2£ and \E\ > 4 \V\. Then cr {Q) satisfies 

cr{S) > 

where ke = Lt [51]. 

We now obtain lower bounds on wiring area given a lower 










bound on the number of independent iterations the code 
allows. 

Lemma 6 (Crossing Number Lower Bound on Awires)- For 
a {dy,dc)-regular LDPC code that allows for at least 
independent decoding iterations, the wiring area ^wires of a 
decoder in Implementation Model (Aj is 

Vt (e ^—for any d^ 

/ 2d^ dfi \ 

n f 6 "^“" j if dydc > 4:{dv + dc). 

Here, 7 S [log [(d„ — 1) {dc — 1)], 3 log(2d.u(ic + 1)] w a con¬ 
stant that depends on the code construction. 

Proof: Let C be a {dy, (ic)-regular LDPC code that allows 
for at least N_ncr independent decoding iterations. Since the 
girth g of C must then satisfy L^J > Kitev^ g > ^Kitcr - 2 - 
From Lemma 1 then, the blocklength n of the code C is 

n = fl (e^—, 

where 7 S [log {{dy — 1) {dc — 1)) ,3\og{2dydc + 1)]. And 
from Lemma 5 we then have A^ires = (e'*'—iter). Now, 
assume dydc > 4((i„ + dc). This requires that dc > dy > 5. 
Let Vc, Ec denote the sets of vertices and edges in the Tanner 
graph of C. The sizes are |f?c I = 'ody and |Vc| = n ^1 + . 

We then carry out the following algebra 

dydc > 'i{dy + dc) ndy > 4n M + ^ j . 


decoding iterations, the decoder wiring area A^ires A 
Awires = O . 

Here, 7 £ [log {{dy — 1) {dc — l)),3\og{2dydc + 1)] is a 
constant that depends on the code construction. 

Proof: Let C be a {dy, (ic)-regular LDPC code that allows 
for no more than A^iter independent decoding iterations. Since 
the girth g of C must then satisfy < Witer, 

g < 4Nitec + 6. 

From Lemma 1 , the blocklength of any such code can be upper 
bounded in the order of A^iter as 

n = 0 , (10) 

where 7 G [log(((i„ — 1) {dc — 1)) ,31og(2ci„dc + 1)]. Then, 
consider a “collinear” VLSI layout [52] of the Tanner graph of 
C which satishes all the assumptions described in Section II-C. 
Arrange all variable-nodes and check-nodes in the graph 
along a horizontal line, leaving A spacing between consecutive 
nodes. The total length of this arrangement is then 0{n). 
Allocate a unique horizontal wiring track for each of the ndy 
edges in the Tanner graph. Then, every connection in the graph 
can be made with two vertical wires (one from each endpoint) 
which connect to the opposite ends of the dedicated horizontal 
track. The total height of this layout is then 0{n), and the total 
area is 0{n^). An example collinear layout is given in Fig. 3. 
Substituting (10) for n, we obtain the bound. □ 


Hence, |£’c| > 4 jVcj. Using the fact that g > 4iV;^gj. — 2, we 
apply Theorem 1 


A ■ — O 

-2^ Wires — 


= n 


A 2 


{ndy) 






2 -/V;, 




df) dr 


( 2 iVit 3 ,-l)" \dy+dc 


2N,, 


(9) 


Then, because > {dy — f){dc — 1) = dydc — {dy + dc) + 

1 > 3{dy -f dc) -V 1, and because dc > dy > 5, we 

must have e'*' > 34. Substituting into (9), 


A ■ — O 

-^wires — ^ 




(2iVi,g,-l)^ \dy+dc 


dydc 


= n a 22 ^*‘' 


dydc 
dy “t“ dr 


2Ny 


and changes-of-base complete the proof. 


□ 


2 ) Upper bound on wiring area: Since the total circuit area 
is always an upper bound on the area occupied by wires, we 
use an upper bound on the circuit area to obtain the following 
upper bound on the wiring area based on the maximum number 
of independent iterations that the code allows for. 


Lemma 7 (Upper bound on Awires)- For a {dy, dc)-regular 
LDPC code that allows for no more than Niter independent 



Fig. 3: An example collinear layout for the same (7,4) Hamming Code 
depicted in Fig. 2. 


We note that this upper bound is crude since the 
O layout construction applies for any graph Q = 

{V,E} which satishes \E\ — O (jV^j). A simple 

proof [45] shows that one can create a layout of area 
^ (d^l + cf (S)) log (jV^I + cr {G))) for any graph. Thus, an 
algorithm for drawing semi-regular graphs which can be 
proven to yield sub-quadratic (in n) crossing numbers would 
yield energy-efficient codes and decoders with short wires. 

B. Total power minimization for the wire model 

We now present analogues of results in Section IV, where 
we instead consider decoding power described by the Wire 
Model of Section II-F. We translate the wiring area bounds of 
Section V-A to power bounds. 
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Theorem 2 (Asymptotic bounds on Pwires)- Under Implemen¬ 
tation Model (X) and Wire Model (^wireA the decoding power 
Pwires far fl {dy, dfaregular binary LDPC code that allows 
far exactly A^iter independent iterations is bounded as 

{ O far any dy, dc 

fl ^ if dydc > 4((i„ + dy) 

O far any dy, dy 


where 7 G [log {{dy — 1 ) {dy — 1)), 3 log(2fi„dc + 1)] U a 
constant that depends on the code construction.. 


Proof: The result is a straightforward conclusion from 
Lemma 6 and Lemma 7 applied in Definition 7. □ 

Next, we present a general lower bound on the scaling 
behavior of total power under the Wire Model for any binary 
regular-LDPC code, decoded using any iterative message¬ 
passing decoding algorithm, for any number of iterations. 


Theorem 3 (Lower bound for regular-LDPCs). The optimal 
total power in the Wire Model (^wirej far a binary {dy,dc)- 
regular LDPC code with any iterative message-passing decod¬ 
ing algorithm to achieve bit-error probability Py is 


.^total 


= n 


log(dc-l) 

log log(<i„ -1) 



Further, if Pt is held fixed as Py ^ Q the total power diverges 
as Cl ^log^ where y > 1, which dominates the power 
required by uncoded transmission. 


Proof: See Appendix D. □ 

Proof Outline: We first substitute the result of Lemma 2 
into Lemma 5, and then use the resulting lower bound on 
decoding power under the Wire Model in (2). Using simple 
calculus, we then derive the asymptotics of the transmit power 
function that minimizes the total power, and plug it back into 
Lemma 2 and (2) to obtain the result. □ 

1) Gallager-A decoding: 


Theorem 4. The optimal total power under Gallager-A decod¬ 
ing (restricted to independent iterations) in the Wire Model 
(Cwirej/or a binary {dy,dy)-regular LDPC code to achieve 
bit-error probability Py is 


Pt 


total,min 


= 0 




\log log 


Where rj > 0 is the constant attenuation in 

the AWGN channel (Section 11-A ) and 7 G 
[log {{dy — 1 ) {dy — 1 )), 3log{2dydc -f 1 )] is a constant 
that depends on the code construction. Further, if Pt is held 
fixed as Py —^ 0, then total power diverges as Cl ^Poly 
which is an exponential function of the power required by 


exponential function of 
uncoded transmission. 


Proof: See Appendix E. □ 

Proof Outline: We first substitute the results of Lemma 3 
into Theorem 2, and then plug in the resulting bounds on 


decoding power in (2). Using some calculus, we then derive 
the best-case and worst-case asymptotics of the transmit power 
function that minimizes the total power, and show that there 
is at most a constant gap between the two. We then plug 
the optimizing transmit power back into Lemma 3 and (2) 
to obtain the result. □ 

2) Gallager-B decoding: 


Theorem 5. The optimal total power under Gallager-B decod¬ 
ing (restricted to independent iterations) in the Wire Model 
(^wire) for a binary {dy,dy)-regular LDPC code to achieve 
bit-error probability Py is bounded as 


P 


total,min ^ 


f! (^log. - 

/ 31 1 

_ 1 U ( log« — 


I I logCdu -l)-log 2 

I log 6 log(2di, dc + 1) 


dydc 

{dy-\-dc) 


<4 


d^dg 


> 4 


any dy, dy 


Further, if Pt is held fixed as Py —>■ 0, then total power di¬ 
verges as Cl ^log^'"^® ^)’ ^ super-quadratic function 

of the power required by uncoded transmission. 


Proof: See Appendix F. □ 

Proof Outline: We first substitute the results of Lemma 4 
into Theorem 2, and then plug in the resulting bounds on 
decoding power in (2). We then use algebraic manipulations 
to bound the exponents in Theorem 2. Next, we use calculus to 
derive the best-case and worst-case asymptotics of the transmit 
power function that minimizes the total power. We then plug 
the optimizing transmit power into Lemma 4 and (2) to obtain 
the results. □ 

C. Comparison with fundamental limits 

In [13], using a more pessimistic Wire Model*, it is shown 
that the total power required for any error-correcting code 
and any message-passing decoding algorithm is fundamentally 
lower bounded by U ^logs -p^j, where is the block- 
error probability. Theorem 3 shows that regular-LDPC codes 
with iterative message-passing decoders cannot do better than 
Cl |^log= where Py is bit-error probability, and the ex¬ 
ponent i can only be obtained in the limit of large degrees 
and vanishing code-rate. Since block-error probability exceeds 
bit-error probability, regular-LDPC codes do not achieve fun¬ 
damental limits® on total-power in the Wire Model. 

Theorem 4 is the first constructive result that shows that 
coding can (asymptotically) outperform uncoded transmission 
in total power for the Wire Model. However, the gap in total 
power between the two is merely a multiplicative factor of 
log log fa. While Theorem 5 proves that it is possible to 

J e 

increase the relative advantage of coding to a fractional power 
of log fa, the difference between the upper bound and the 
power for uncoded transmission is minuscule. The exponent 
of log ^ in the upper bound is an increasing function of both 


^The Wire Model of [13] assumes the power is proportional to Awires^iter- 
Here it is assumed to be simply proportional to Awires- 

^Though, this may simply mean the fundamental limits [13] are not tight. 
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dy and dc, approaching 1 as either gets large. Since Gallager- 
B decoding requires dy > 4, the smallest exponent for regular 
LDPCs occurs when dy = 4 and dy = 5. The numerical value 
of the exponent for these degrees is « 0.98, which suggests 
little order sense improvement over uncoded transmission. 
Hence, the wiring area at the decoder (particularly, how much 
better it can be than the bound of Lemma 7) is crucial in 
determining how much can be gained by using Gallager-B 
decoding instead of uncoded transmission. Further discussion 
is provided in Section VII. 

VI. Circuit simulation based numerical results 

At reasonable bit-error probabilities (e.g., 10“®) and short 
distances (e.g., less than hve meters), asymptotic bounds 
cannot provide precise answers on which codes to use. For 
example, consider the following problem, shown graphically 
in Fig. lb). 

Problem 1. Suppose we want to design a point-to-point 
communication system that operates over a given channel. 
We are given a target bit-error probability Py, communication 
distance r, and system data-rate i?data that the link must 
operate at. Which code and corresponding decoding algorithm 
minimize the total (i.e. transmit + decoding) power? 

Since the bounds of Sections II-V are derived as Py 0, 
they may not be applicable to many instances of Problem 1. 
In this section we therefore develop a methodology for rapidly 
exploring a space of codes and decoding algorithms to answer 
specihc instances of Problem I . We focus on one-bit Gallager 
A and B [29] and two-bit [53] decoding algorithms, restricting 
the number of algorithmic iterations to Because of 

the effort required in implementing or even simulating a 
single decoder in hardware, we construct modelsfor power 
consumed in decoding implementations of different algorithms 
based on post-layout circuit simulations for simple check-node 
and variable-node circuits. The models developed attempt to 
capture detailed physical aspects (e.g., interconnect lengths 
and impedance parameters, propagation delays, silicon area, 
and power-performance tradeoffs) of implementations, in stark 
contrast with their theoretical counterparts of Sections II-V. In 
Section VI-C, we use these models to investigate solutions to 
some instances of Problem 1. 

A. Note on channels and constellation size 

To answer Problem 1, additional physical assumptions about 
the channel (e.g., bandwidth, fading, path-loss, temperature, 
constellation size) are required in comparison to the model of 
Section II-A. The channel is still assumed to be AWGN with 
hxed attenuation. However, while Section II-A assumes BPSK 
modulation for all transmissions, due to the introduction of a 
data-rate constraint and fixed passband bandwidth W (for fair 
comparison), the constellation size is required to vary based on 
the code rate. Explicitly, the transmission strategy is assumed 
to use either BPSK or square-QAM modulation, mapping 
codeword bits to constellation symbols. We assume that if 

*®These models have been created in a open-access CMOS library [54] and 
are online at [55], 


square-QAM modulation is used, the information bits are 
mapped onto the constellation signals using a two-dimensional 
Gray code as explained in [56, Section III]. We assume the 
transmitter signals at a rate of W symbols/s and that the 
minimum square constellation size (M) satisfying the system 
data-rate requirement is chosen; M is always the smallest 
square of an even integer for which: 

jyp ^ 2^nata/(lU X rt) 


For calculating transmit power numbers, the thermal noise 
variance used is = kTW, where k is the Boltzmann 
constant (1.38 x 10“^® J/K), and T is the temperature. The 
power is assumed to decay according to a power-law path- 
loss model l/r“, where a is the path-loss coefficient, the 
received ^ is obtained as a function of the system and 
channel parameters: 


Eb _ _Hr_ 

“ fcTIL(^)“log2(M)’ 


( 11 ) 


where A is the wavelength of transmission at center frequency 
fc in Hz (A = 3 X 10®//c). The channel flip probability for 
BPSK transmissions under this model is pg = Q (^7^)’ 
and the channel flip probability for M-ary square QAM is [56, 
Section III.B]: 


1 


Po = 


\og2{VM) 
X I 2''-! - 


E E 

fc=l 7=0 

j X 2'=-! 1 

L s/M ^2 


(—1)L vm J 


(2J + 1)1 


/3#^log2(M) 
(M-1) 


( 12 ) 


Also, note that the asymptotic bounds derived in Sections II- 
V remain unchanged, even if we substitute M-ary QAM for 
BPSK as the signaling constellation. This follows from the 
fact that the RHS of equation (12) is a linear combination of 
Q(-) functions with argument linearly proportional to ^ 

( -(pP'jp \ 

-j===\ for some 

constant <p ^ p (see (1)). Since the difference is merely a 
constant, the asymptotic analysis of Sections II-V holds. 

For the results presented in Section VI-C, we assume the 
decoding throughput is required to be equal to i?data = ^ 
Gb/s. We assume a channel center frequency of fy = 60 GHz 
and bandwidth of VF = 7 GHz. The temperature T is 300 K. 
The distances considered are much larger than the wavelength 
of transmission (« 0.5 cm) so the “far-fleld approximation” 
applies. 


B. Simulation-based models of LDPC decoders 

Given a code, decoding algorithm, and desired data-rate, 
calculating the required decoding power is a difficult task. 
Even within the family of regular LDPC codes and specified 
decoding algorithms, the decoder can be implemented in myr¬ 
iad ways. The choice of circuit architecture, implementation 
technology, and even process-specific transistor options can 
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have a significant impact on the decoding power [8], [4]. A 
comprehensive solution to Problem 1 requires optimization 
of total power over not just super-exponentially many codes 
and decoding algorithms, but also all decoder architectures, 
implementation technologies, and process options, which could 
be an impossibly hard problem. The models we present 
here are based on simulations of synchronous, fully-parallel 
decoding architectures in a 32/28nm CMOS process with a 
high threshold voltage, and are used in Section VTC to obtain 
insights on the nature of optimal solutions. We believe that 
incorporating more models of this nature and performing the 
resulting optimization could be a good approach to obtain 
low total power solutions. We now describe how the model 
is generated. 

1) Initial post-layout simulations: Our models for arbitrary- 
blocklength LDPC decoders are constructed based on circuit 
simulations using the Synopsys 32/28nm high threshold volt¬ 
age CMOS process with 9 metal-layers [54]. First, post-layout 
simulations of check-node and variable-node circuits for one- 
bit and two-bit decoders are performed. The physical area, 
power consumption, and critical-path delays of the check- 
nodes and variable-nodes are used as the basis for our models. 
The CAD flow used is detailed in Appendix G. The next sec¬ 
tion details how these results are generalized to full decoders. 

2) Physical model of LDPC decoding: Even within our 
imposed restrictions on the LDPC code degrees, girth, and 
number of message-passing bits for decoding, constructing a 
decoding power model that applies to all combinations of these 
code parameters requires some assumptions: 

1. Decoders operate at a fixed supply voltage (chosen 
as 0.78V: the minimum supply voltage of the timing 
libraries included with the standard-cell library). 

2. The code design space includes regular-LDPC codes 

with variable-node degrees 2 < < 6, check-node 

degrees 3 < dc < 13, and girths 6 < p < 10. 

3. “Minimum-Blocklength” codes (found in [57]) are cho¬ 

sen for a given g,dy,dc- Hence the blocklength is 
expressed as a function of these parameters: . 

4. The decoding algorithm a, is chosen from the set 
{A,B,T}, where A, B, T correspond to Gallager-A, 
Gallager-B, and Two-bit" [53] message-passing decod¬ 
ing algorithms, respectively. We use #bits (a) to refer 
to the number of message bits used in algorithm a. 

We then model the minimum-achievable clock period Tclk, 
and maximum-achievable decoding throughput i?Dec for each 
decoder as functions of a,g,dy,dc- 


Tclk (a, 5, dv,dc) 


f^Dec (tr, g, dy , dc) 


Tvn (a, dy) 

2T[vire (tt, g, dy , dc) 

Tctt{a,dc) (13) 

(min) 

^g,dy,dc dc J 

L^J X TcLK{a,g,dy,dc) 


(14) 


In (13), Tvn(’,’) and Ton (', •) are critical-path de¬ 
lays through variable and check nodes respectively and 


"with fixed decoding algorithm parameters chosen as C = 2, S = 2, 
W = 1, for reasons explained in [53, Section II]. 


Twire is the propagation delay through a single 

message-passing interconnect. In essence, (13) formulates 
the critical-path delay for the decoder by summing up the 
propagation delays of all logic stages traversed in a single 
decoding iteration. Details for each component are given in 
Appendix H. We model the decoding power as 


Puec {a,g,dy,dc) = dc 


PvN (a, dy) -I- 


dyPctt (a, dc) 


T ^idy^lyits (u) X Pwire (H; Pi dy, dc) 


15) 


In (15), Pvn(',’) and Pcn (', •) are the power consumed 
in individual variable and check nodes respectively, and 
Pwire (’i’i •) •) is the powei' consumed in a single message¬ 
passing interconnect. Note that (15) is a sum of all power 
consumed in computations and wires of the decoder (the 
coefficients in (15) count the number of occurrences of each 
power sink in the decoder). The details of the node power 
models are given in Appendix I and the details of the wire 

power model are given in Appendix J. 

3) Satisfying the communication data-rate: Fixing the sup¬ 
ply voltage for a decoder and using the fastest possible clock 
speed only allows for a single decoding throughput. Hence, 
parallelism in order to meet the system data-rate requirement 
Pdata in Problem 1 is also modeled. For example, two copies 
of a single decoder can be used in parallel. Together, they 
provide twice the throughput, and require twice the power 
of a single decoder. In the corresponding communication 
system architecture, two separate codewords are required to 
be transmitted at twice the throughput of a single decoder, and 
a multiplexer at the receiver must pass a separate codeword to 
each of the parallel decoders, which decode the two codewords 
independently. Though making such a design choice in prac¬ 
tice would introduce additional hardware and a slight power 
consumption overhead, we ignore this cost in our analysis. 

In cases where integer multiples of a single decoder’s 
throughput do not exactly reach Pdata^ we first find the 
minimum number of parallel decoders, that when combined, 
exceed the required throughput. Calling this minimum number 
of decoders Q, we then assume that the clock period of each of 
the parallel decoders is increased until the overall throughput 
of the parallel combination is exactly Pdata- Explicitly, the 
formula to determine this “underclocked” period r„ is: 


Tu 


e X (i - 1 ) 

L^J X i?data 


(16) 


Because the decoding power is modeled as inversely propor¬ 
tional to the decoder clock period (see Appendices I-J), we 
multiply each individual decoder’s power by the appropriate 
scaling factor n = '^ci.K{a^g,dc,dc) ^ multiply the result 

by the number of parallel decoders to get the total power of 
the parallel combination: 


Pparallel — Q X PY)ec{^^ 9: dy, dc) X K. (17) 

We substitute (14), (16), and carry out some algebra to obtain: 

Pparallel = PDec{a, g, dy , dc) X — - ^ ([g) 
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Hence, we assume that any (throughput, power) pair that is a 
multiple of the specifications of the original decoder can be 
achieved in this manner (with the obvious exception of points 
that have negative throughput and power). Therefore, in our 
analysis in Section VTC, we assume the decoding throughput 
is exactly i?data and we use the decoding power numbers 
obtained via this interpolation between the modeled points. 

4) Comparing different coding strategies: Now, given a 
subset of codes and decoders, how should a system designer 
jointly choose a code and decoding algorithm to minimize 
the total system power? Within the channel model of Sec¬ 
tion VI-A, consider specific instances of Problem 1 : let path- 
loss coefficient a and i?data be hxed. Then, for each choice 
of (r, Pe), we can compare the required total power for 
each combination of code and decoding algorithm modeled 
in Section VTB2, and hnd the minimizing combination. 

C. Example: 60 GHz point-to-point communication 

An example plot which shows the minimum achievable total 
power for different values at a hxed distance r = 3.2m and 
a = 3 is given in Fig. 4. The plot also shows the curve of the 
optimizing transmit power, P^, and the Shannon-limit [58] for 
the AWGN channel. The horizontal gap between the optimiz¬ 
ing Pt curve and the total power curve in Fig. 4 corresponds 
to the optimizing decoding power. As Pe decreases, this gap 
increases, indicating an increase in the total power-minimizing 
decoder’s complexity. 



Fig. 4: A plot of logj^pjPe) vs. minimum achievable total power for o = 3 
at a fixed distance of r = 3.2m. The Shannon limit for the channel and the 
optimizing transmit power are also shown. 

1) Joint optimization over code-decoder pairs: The form 
of the total power curve varies with communication distance. 
For improved understanding, we use two-dimensional contour 
plots in the (r, Pe) space to evaluate choices of codes and 
decoders, as suggested by Fig. lb). An example is shown 
in Fig. 5, which compares code and decoding algorithm 
choices for path-loss coefficient a = 3. In the top plot, 
the contours represent regions in the (r, Pe) space where 
specihc combinations minimize total power, and in the bottom 
plot, regions in the (r, Pe) space are divided based on the 
value of the minimum total power. The best choices for these 
instances of Problem 1 turn out to be rate ^ codes. Lower rate 
codes require large constellations for a 7 Gb/s data-rate, thus 


requiring large transmit power for the same po, and higher 
rate codes require larger decoding power due to increased 
complexity and size of higher degree nodes. Some tradeoffs 
between total power and code and decoder complexity can 
also be observed in Fig. 5; to minimize total power, algorithm 
complexity a should increase with r and code girth g should 
increase with decreasing Pe- 




Distance (m) 



CQ 

T3 


Fig. 5: Contour plots of the optimizing code & decoding algorithm choice 
(top) and the minimum total power in dBm (bottom). For these plots, 0 = 3. 
The contours in the top plots are labeled with blocklength n, code girth g, 
VN degree dy, CN degree dc, and decoding algorithm a of the optimizing 
code and decoder. To interpret the plots, one can choose any point in the (r, 
Pe) space and find the best coding strategy (within the search space) in the 
top plot and the required total power to implement it in the bottom plot. The 
plot is best viewed in color. 


How does the inclusion of uncoded transmission as a 
possible strategy change the picture? Contour plots with 
uncoded transmission included are given in Fig. 6. Comparing 
Fig. 6 with Fig. 5, we see that when uncoded transmission is 
included, it overtakes areas in the (r, Pe) space where Pe is 
high and r is very small. However, Fig. 6 suggests that simple 
codes and decoders can still outperform uncoded transmission 
at reasonably low Pe and distances of several meters or more. 


VII. Conclusions and discussions 

In this work, we performed asymptotic analysis of the 
total (transmit H- decoding) power for regular-LDPC codes 
with iterative-message passing decoders. While these codes 
(with Gallager-B decoding) can achieve fundamental limits 
in the Node Model [6], they are unable to do so for the 
Wire Model [13]. This suggests that measuring complexity 
of decoding by simply counting the number of operations 
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Distance (m) 
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Distance (m) 


CQ 

20 ^ 


Fig. 6: Contour plots of the optimizing code & decoding algorithm choice 
(top) and the minimum total power in dBm (bottom), including uncoded 
transmission in the optimization space. For these plots, a = 3. The contours 
in the top plots are labeled with blocklength n, code girth g, VN degree dy, 
CN degree dc, and decoding algorithm a of the optimizing code and decoder. 
To interpret the plots, one can choose any point in the (r, Pe) space and 
find the best coding strategy (within the search space) in the top plot and 
the required total power to implement it in the bottom plot. The plot is best 
viewed in color. 


(e.g., [59], [60], [61]) is insufficient for understanding system- 
level power consumption. In fact, for the Wire Model, even 
achieving order-sense advantage over uncoded transmission 
requires that both transmit and decoding power diverge to oo 
as Pe —> 0, which calls into question the assumption that one 
should fix the transmit power and operate near the Shannon 
capacity in order communicate reliably at a low power cost. 
However, this analysis also established a result of intellectual 
interest: that regular-LDPC codes can achieve an order-sense 
improvement in total power over uncoded transmission as 
bit-error probability tends to zero. This question only arises 
from the total power perspective adopted in this work, and it 
suggests that these results are only scratching the surface of a 
deeper theory in this direction. 

To establish some constructive results, we analyzed two 
strategies where the number of decoding iterations is dictated 
by the girth of the code. Although this is convenient for 
proving asymptotic upper bounds on total power, this is rarely 
followed in practice. Typically, combinatorial properties of 
the code construction are analyzed and simulations are per¬ 
formed [30] in order to discern the error-probability behavior 
in decoding iterations beyond the girth-limit. However, we are 
not sure if better asymptotics for total power can be achieved 
by merely adding these additional iterations. 


Our work highlights an important question that has received 
little attention in coding theory literature: design of codes 
that have good performance while maintaining small wiring 
area (see [62], [63], [64] for some heuristic approaches to 
generating Tanner graphs with low wiring complexity). For 
wire power consumption, there is a significant gap between 
the bounds on power consumed by regular-LDPC codes and 
iterative message-passing decoders derived here, and the fun¬ 
damental limits derived in [13]. Nevertheless, even though 
regular-LDPC codes might not achieve these fundamental 
limits (and the fundamental limits themselves may not be 
tight), it is important to investigate wiring complexity of 
other coding families, such as Polar codes [61] and Turbo 
codes [65]. 

Recent work of Blake and Kschischang [18] studied the 
limiting bisection-width [12] of sequences of bipartite graphs 
with the size of the left-partite set tending to infinity, when 
the limiting degree-distributions of the left and right partite 
sets satisfy a certain sufficient condition [18, Theorem 1]. It 
is shown that when sequences satisfying this condition are 
generated by a standard uniform random configuration model 
(see [ 18, Section IV] for definition), the resulting graphs have 
a super-linear (in the number of vertices) bisection-width in 
the limit of the sequence with probability 1. In Corollary 2 and 
Section IV. A of [18], the authors show that the Tanner graphs 
of all capacity-approaching LDPC sequences as well as some 
regular-LDPC sequences generated using this method will 
satisfy the sufficient conditions. A super-linear bisection width 
for a graph implies that the area of wires in the corresponding 
VLSI circuit must scale at least quadratically in the number 
of vertices [12]. If using the decoding strategy of Theorem 5 
then, such sequences of codes will have minimum total power 
that is 0 ^log™ where 0.97 < m < 1, providing little 
order-sense improvement over uncoded transmission. 

The authors of [18] point out the fact that their result does 
not rule out the possibility that there may exist a zero-measure 
(asymptotically in n) subset of codes that has sub-quadratic 
wiring area. One could try to extend the bisection-width'^ 
approach of [18] to establish a negative result (i.e., prove 
that no such zero-measure set exists). To establish a positive 
result, one could try the open problem mentioned at the end 
of Section V-A2, namely, construct a graph-drawing algorithm 
that yields sub-quadratic crossing numbers for (even some 
classes of) semi-regular graphs. In any case, a proof is needed 
and heuristics such as those used in [64] (even if they work 
well in practice) cannot establish guarantees. 

The simulation-based estimates of decoding power pre¬ 
sented in Section VI confirm that coding can be useful for 
minimizing total power, even at short-distances. For instance, 
they predict that regular-LDPC codes with simple message¬ 
passing decoders can achieve lower bit-error probabilities than 
uncoded transmission in short distance settings, while still 
consuming the same total power (even at distances as low 
as 2 meters). However, in these regimes, it is possible that 


*^The crossing number cr (Q) and bisection width bw {Q) of a bounded- 
degree graph Q = {V, E} are related by the inequality cr (Q) + © (| V|) = 
Q {hvP (£?)) [66, Theorem 2.1]. 
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“classical” algebraic codes (e.g., Hamming or Reed-Solomon 
codes [67]) might be even more efficient, hence, they need to 
be examined as well. 

Finally, the results of Section VI point to a new problem, 
that of “energy-adaptive codes”. The suggestion from these 
results is that the code should be adapted to changing error- 
probabilities and distances. Can a single code, with a single 
piece of reconfigurable decoding hardware, enable adaptation 
of transmit and circuit power to minimize total energy? Indeed, 
some follow-up work [ 68 ] indicates this is possible, and it 
could be a promising direction for future work. 
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Appendix A 
Prooe oe Lemma 2 

Proof of Lemma 2: First, note that the H( ) expression 
in Lemma 2 contains two variables: -jr and Pt- We ana¬ 
lyze blocklength as a function n : [ 2 , cxd) x IR-° —)■ 
(Definition 3). First consider the case where dc > dy > 3. 
Because the codes considered are binary and linear, the 
channel is memoryless, binary-input, and output-symmetric, 
and the decoding computations are symmetric with respect 
to codewords, we can assume without loss of generality that 
the all-zero codeword was transmitted [25], [40, Page 22]. 
In [40, Page 37] it is shown that for any binary regular- 
LDPC code with dc > dy > 3 used to transmit over an 
AWGN channel, the probability that any iterative message¬ 
passing decoder incorrectly decides on pseudo-codeword'^ oj 
when the all-zero codeword was transmitted is 

> Q (^y2|^u;AWGN(^)^ ^ (19) 

where (uj) is called the AWGN pseudoweight of oj 

Defined in [40] as an error-pattern for a given code C, such that the 
lifting of the error-pattern is a codeword of the binary code corresponding to 
some finite graph-cover of the Tanner graph of C. It is explained in [40] that 
no “locally operating” iterative message-passing decoding algorithm (“locally 
operating” subsumes all algorithms satisfying the assumptions of Section II-B) 
can distinguish between codewords and pseudo-codewords. 


and is defined [40, Definition 31] as 


AWGN 

UJp 


{oj) 


||w|l 2 

0 


if uj^O. 
if u; = 0. 


While the channel model of Section II-A assumed a hard- 
decision on the AWGN-channel outputs, (19) holds even when 
log-likelihood ratios (which have no loss in optimality) are 
used in message-passing [40]. Hence, we can use (19) to obtain 
a lower bound for any message-passing decoder. The minimum 
pseudoweight y;AWGN.min ^ parity-check matrix of a given 
code is defined as the minimum AWGN pseudoweight over 
all nonzero pseudo-codewords of the code [40, Definition 37]. 
For any regular-LDPC code of blocklength n with dy > 3, 
the minimum AWGN pseudoweight is upper-bounded as [40, 
Proposition 49], [69, Theorem 7]: 


AWGN,min 

Wp 


H < 


/ dy{dy 1 ) 

V {dy-2) 


2l0g(dt;-l) 

^ Iog(<i^; - l)(tic - 1) 


( 20 ) 


Therefore, lower bounding the word-error probability 
by the pairwise error-probability and using (20) in (19): 


pword > > 



^Es fdy{dy-l)\ 2 1og(d„-l) 

2— ^ -i „log(c!„-l)(dc-l) 

Wo ( {dy - 2) 


( 21 ) 

Using our notation from Section II-A, ^ = tjPt, and trivially 
bounding bit-error probability P^. [70, Eqn. (2)]: 

TDWord 

Py > - 


n 


> 


> 


(*) 

> 


y 2r]PT ^ -iK-ic-i) 


n 

1 -r/PT( ^^ytlog(d^,-lJ(de-l) 


2 1og(d,;-l) 


log(dc-l) 
Jl log(d^;-l) 




{dy-2) 


_ d^-2 

T^Pt dv{d^ — l) 


. 2 1og(d^,-l) 


,1+GgS^^r^ (2V^ + 


-( 22 ) 


where (•) holds because of ( 1 ) and n > 1 , and (*) holds 
because n > 1 and dy{dy — 1) > {dy — 2). It follows that 
r > 


whenever Pt > ' 


?7’ 


Pe > 


^ 2 log{dy-l) 


21og(d^-l) 

, , -,,PT(l+97r)('^#Lp) n^«si.dy-indo-i) 

> — --— (23) 


where (*) holds because e “ < 2 for all x > 0. Inverting 
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both sides of (23), taking log(-) and then simplifying, 

log;^ 


+ 


Pt>- 

^ — V 
< 


yPril + 97r) 
_ 2 _ 

I log(dc-l) 

n ioE(<iu-i) + log n 


log(dc-l) 

^ log(d^-l) 


(24) 


We have shown that (24) holds for any Pg and any Pt > 
4, hence ignoring the non-dominating term on the RHS and 

then raising both sides to the power ^°^ 2 'iog(d 

the desired result. For the case when = 2, because the 

minimum distance of regular-LDPC codes with = 2 is at 


2 log t 
log(dc-l) 


3rT (see [29, Theorem 2.5]), the pairwise error- 


most 2 + 

probability that a minimum-weight nonzero codeword x' is 
decoded when the all-zero codeword was transmitted is 


Pf 


0 —_ 



2^ (2 + ^121^' 

Wo V log(4-l). 


Replacing ^ by rjPT, the word-error probability is 


P >word \ D 

e ^ ^O^x' = 



2rjPT 2 -b 


2 log : 


log{dg - 1 )^ 


(25) 


• (26) 


Then applying an identical analysis to the > 3 case, for 
any bit-error probability Pg-. 


P. > 



Then for any Pt > 4, we also have 


P. > 




3 n ^ ttijPt (2 


> 


2 log t 

log(dc-l) ^ 

(27) 

where (*) holds because e~^ < - for all x > 0. Inverting 
both sides of (27), taking log(-) and then simplifying. 


log 7 ^ 


77Pr(l + Ott) 


log 


<2 
PT>k 


21 ogf 


logn 


log(dc - 1 ) r 7 Pr(l + 97 r) 


rjPril + 97r) 


< 2 + 


2 log n — 2 log 2 


?T>1 
< 2 


< 


log(dc - 1 ) 
2 log n — 2 log 2 
log(4 - 1 ) 

2 


f logn 
2 logn 


log(dc - 1 ) 


( 1 -blogn). (28) 


Dividing both sides of (28) by ^2 + iog(d - 1 ) j ^ taking e*- ! on 
both sides, and simplifying; 

n > - (29) 

which completes the proof of Lemma 2. □ 

Appendix B 
Proof of Lemma 3 

Proof of Lemma 3: First, note the 0 (•) expression in 
Lemma 3 contains two variables: W and Pt- We analyze 
the minimum number of independent iterations as a function 
iViter : [ 2 , 00 ) X [A^,oo) —>■ IR-° (Definition 3), where 
A .4 > 0 is the transmit power for which po is exactly the 
threshold for decoding over the BSC [29, Section 4.3], [25]. 
Explicitly, if a a is the threshold for Gallager-A decoding over 
the BSC, Q (^/277A^) = cr^. 

Note when Pt < Aa, it is not possible to force Pg 0, 
hence Niter will be infinite for all Pg below some con¬ 
stant [25]. No further analysis is needed for such low transmit 
powers, since all Pg above said constant can be achieved 
with 0 ( 1 ) transmit power and 0 ( 1 ) decoding iterations. 
From [25, Eqn. ( 6 )], the bit-error probability after the ith 
decoding iteration, pi, is 


Pi = Po- Po 


1 + (1 - 2p,i) 


dc— 1 




+ (l-Po) 


l-(l-2p,_i)'^"-i 
2 


(30) 


Since the RHS of (30) is differentiable with respect to (w.r.t.) 
Pi- 1 , by Taylor’s Theorem there exists a real function Ri{x) 
with lima;_>.o Ri{x) = 0 such that; 


Pi =po{dv - l)(dc- l)Pi-i -l-Pi(pi-i). 


(31) 


The RHS of (31) is the first-order MacLaurin expansion of 
Pi. Eurther, because the RHS of (30) is a polynomial in pi-i, 
it is twice continuously differentiable and by the mean value 
theorem the remainder term i?i(pi_i) has Lagrange form: 




o t ^ 1 d'^Pii.X*) 2 


(32) 


where x* € (0,pi_i). It can be verified that the second 
derivative of pi w.r.t. pi-i is minimized at pi-i = 0 and 
maximized at Pi-i = Solving for both cases and plugging 
into (32), we find 


-Poidy - l)(dc - 1) 


(d„-2)(d,-l) 


+ (dg — 2) 


Pi-1 


<Ri<0 


(33) 


Plugging (33) into (31) and applying the RHS recursively, the 
bit-error probability after ith decoding iteration pi is: 


Poidy - l)idc - l)Pr-l 


1 -p*-l 


idy-2)idg-l) 


+ {dg — 2 ) 


<Pi< [po{dy - l)idg - 1)]*. 


(34) 
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Now, choose an arbitrary 0 < 5 < | and choose Pt (thereby 
Po as well). As explained in [29, Section 4.3], since we 
are operating above the threshold, we are guaranteed that 
Pi < Pi-i < Po and Pi 0. Thus, for sufficiently small pi 
(thereby small Pi-i) or sufficiently large Pt (thereby small 
Po), (34) becomes: 

Po{dy - l)(dc - 1)(1 - S)p^-l <pi< [poidv - l)idc - 1)]* 

(35) 

Applying the relation on the LHS of (35) recursively 


Ly(d.-1)(4-1) 


Po 

(2 pPt + 1) 


5<i 


< Pi < [poidv - 1)(4 - 1)] 


( 1 ) 

< 


{dy - l){dc - 1 )- 


2 (2?7Pt + 1 ) 

=-r)PT 


( 1 ) 

< Pi 


N. 


iter 


1 + 


2r]PT 


< 


<Nu 


logj^ 
rjPr 
log (2pPT + 1) 


■qPr 


1 - 


pPt ' r]PT 

log ^Pt _ log0.5((i^ - l)(de - 1) 
2r]PT 


+ 1 + 


■qPr 

log (2pPT + 1) log JPt 


t?Pt 


2 pPt 


No. 


1 - 


log(d„ - l)(dc - 1) 


pPt 


< 


Pt > - > - ;dc ><ii; >2 

J- - ^ ^ J C-- U - 


< 


Niter [1 + log 3] + 1 ■ 


Pt>I>0 log^ 

vPt 

logs 


2-loiter 


p ^ 2 log(du — l)(dc — 1) 
P— T] 

< 


_e 

ijPt 


< [1 + log 3] Niter + 3, 


Pi =P0 


Po 


d„-l 

2^dv—l ^ ^ 

dv-1 

— 


dq, 1 


[1 + (1 - 2p,_i)''=-i] 


X [1 - (1 - 2p,_tf^-^] 

d„-l 


^Po 

2d 


- PO 
:r;-l 


dc — 1 
dd) 1 




m 




[1 + (1 - 2p,_tf^-^] 


dv — l — m 


d^-1 


(39) 
th order 


(36) 


The RHS of (39) is a polynomial in pi-i and the ^ 
Maclaurin expansion is 

Pi = Po (dc - 1)^ PilJT + Ppfe-i)- (40) 

Because the RHS of (39) is a polynomial, by the mean value 
theorem, the remainder has a Lagrange form (where x* G 
(0,Pi-i)): 


RsiPi-i) = 


Inverting all sides of (36), taking log(-) on all sides, replacing 
Pi by Pe and i by A^iter^ and dividing all sides by r]PT'. 

logryPr \og{dy - l){dc - 1) log20r' 




The 


d„+l 


(37) 


We have shown that (37) holds for any choice of Pt as long 
as Pe is sufficiently small, which completes the proof of the 
constant Pt result. Next, set Pt > max{ , ^}. 

As explained above, (37) also holds for any Pg as long as Pt 
is sufficiently large. In this case (37) simplifies to 


th derivative of pi is another polynomial; therefore 
it must be bounded on the bounded interval [0, 4] and 

- cfp “Y < 77 b (pi-1) < cf P “TT: (42) 

for some constants cf,Cy > 0. Then choose Pt- Since 
we exceed the decoding threshold [25], pi < Pi-i < po- 
Now, take i to be the final iteration. Since we assumed 
limp , ^^1 = 0, we will also have limp = 0 

log i pp 

(the number of decoding iterations used in the coding strategy 
eventually exceeds 1 as Pg —> 0). Hence, for sufficiently small 
Pi (thereby small Pi_i), 


- ^Po {dc - 1) PiJi < RsiPi-i) 

< xPo {dc - 1)^ PilT- (43) 


Plugging (43) into (40), we have 
'fdv-Af, 

Po d„-i (dc - 1) = X 


< Po 




2 

iiv-i 3 
2 _ 

2 


Pr-\ < Pi 


Pi-\ ■ 


(44) 


(38) 

□ 


Applying (44) recursively, we obtain 


which completes the proof of the Lemma. 

Appendix C 
Proof of Lemma 4 

Proof of Lemma 4: We analyze the number of independent 
iterations as a function Noer ■ [2,c») —)• IR-°, since even in 
the second case, the transmit power is a function of Let 
Ab > 0 be the transmit power for which po is exactly the 
threshold for decoding over the BSC [29, Section 4.3], [25]. 
As explained in the proof of Lemma 3, we need not consider 
cases where Pp < Ap. Using [29, Eqn. 4.15], for dy odd, the 
bit-error probability after the *th decoding iteration follows 


H-" 

Po 




d,,-l 




< Pi 


< 




(4_i)(4^)+-+(4^r 


(45) 


Loosening the LHS and RHS of (45) and grouping like-terms 
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we have 


i+-+(^)’ 

Po 


d -1\ l^ 

I-i 1(4-1), 


(z>0;d<,>2) (po<l;dc>2) / /- l\ 3 

-Pi - [PO J 2 




X 




(^) 


Afit 


- 1 




- 1 




< log 


1 


< 


^ log- 

-1 Po 


k 2 ; 

_i xMt. 


+ 


- 1 


■log 




Applying (1) on pq terms in (47), dividing all sides by 
pPr, loosening the RHS by ignoring negative terms, and 
simplifying: 

^ r logdTTpPT - 2 l 0 g |(^^)(fic - 1 ) 

'' 2 > 2 + -—. 


(4 - 3) 


< < log(4 - 1) 


pPt 


pPr 

(4 - 3) 


pPr 


2 + 


•log ( 44 ^ 474 +.y^) 

pPt 



- 1 


2 -e 


< 


(4 - 3) pPr 


'4-l\^“”'^^ 2 + e 


2 y (4 - 3) ■ 

Taking log(-) on all sides of (49) and simplifying, we obtain: 



< log 


log A 

P-Pr 


(4 - 3) ’ 


which is equivalent to the desired result. 

Appendix D 
Proof of Theorem 3 
Proof of Theorem 3: Since we are proving a lower 


bound, we can restrict ourselves to the case where dy > 3 
without loss of generality (the decoding power when dy = 2 
grows exponentially faster). Via Lemma 2 and Lemma 5, 
the total power required for a (4, c(c)-regular LDPC code 
and any iterative message-passing decoder to achieve bit-error 
probability Pe under the Wire Model is 


/ 


(46) 


Simplifying geometric progressions, inverting all sides of (46), 
taking log(-) on all sides, and replacing pi by Py and i by 
Niter- 


Pt 


total 


= n 


. log(de —1) 

log(ci7 


ZllI \ 


Pt 


logN 


V 


t]Pt{1 + 9tt) ^ 


d^-2 


J 


(51) 

First, it follows from (22) that if Pt is kept fixed while 
Pe 0 , the total power (and the decoding power) diverges as 


Ptotal,bdd Pt — ^ | 1*^§ 


log(dc-l) 

lQg(ti^ —1) 


(52) 


(47) 


The exponent of log -p- in (52) is always greater than 1 since 
dc > dy for any regular-LDPC code. Next, differentiating the 
expressions inside the (•) of (51) w.r.t. Pt, setting to zero, 
and substituting the minimizing transmit power into (51) we 
hnd that the minimum total power is: 


log(dc-l) 1 

log(d„ —1) - 

Pe 


Ptotal,min — ^ 

which completes the proof of the theorem. 


(53) 


□ 


■ (48) 


We have shown that (48) holds for any Pt as long as Pe is 
sufficiently small. Thus, treating Pt as a constant in (48) and 
taking log( ) on all sides completes the proof of the hxed Pt 
result. For the other case, consider the limits of the leftmost 
and rightmost side of (48) as Pt —> 00 . For any e > 0, for 
sufficiently large Pt the following holds: 

Niter \ n . log 4 


Appendix E 
Proof of Theorem 4 

(P y 

Proof of Theorem 4: Let A^ter denote the minimum num¬ 
ber of independent Gallager-A decoding iterations required 
to achieve bit-error probability Pe- Via Theorem 2, the total 
power is lower bounded by Ptotai = Pt + ^ It 

follows from Lemma 3 that if Pt is kept hxed as Pe 0, then 
the required decoding power diverges atleast as fast as a power 
of which is exponentially larger than the power required 
for uncoded transmission. If instead the transmit power is 
allowed to vary, it follows from Lemma 3 that 


■Ptotai = n Pt + 


(49) 


Pt 


total 


= O Pr + 



(54) 

/ 

2~i \ 


riPp \ 

(55) 


In order to hnd the optimizing transmit power, let Lp^(Pt) 
denote the function in the n(-) expression of (54) and let 
UpSPt) denote the function in the O (•) expression of (55): 


(50) 



{ 1 ^ 

\ 


□ 

Ppe (Pt) = 

= Pt + 1 


) 

(56) 




( 1 ^ 



Up^{Pt) - 

= Pt + 1 

KPe, 

) • 

(57) 


We start by analyzing the lower bound. To hnd the Pt which 
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minimizes Lp^, we differentiate Lp^ and set it to 0 


If instead Pp is allowed to scale as a function of P^, 


dLf 


dPr 


= 1 — e 


A 7 log ^ 
vPp 


= 0 


Aotal 


Theorem 2 


n (Pt + 


p2 

1 log p- 


log 


— g V^T 


(58) 


Now, let P = 


p^ _ 


Pt 


7 log 


= . Substituting into (58), we get 


^ | ^ log( ) 


( 66 ) 

(67) 

( 68 ) 


logPe 


log'P 


2V log P = -^ 
1 

'' 2V rj 


/ 7 log A 


(59) 

(60) 


Using the upper bound from Theorem 2, 

Aogf \ 


Pt 


total 


= O Pt 


2-1 


Pt 


(69) 


The positive, real valued solution to (60) is given by the 
principal branch Wo(-) of the Lambert W function [71]. 
Explicitly, when x,z G K-° satisfy the relation x = ze^, 
we say z = Wo(a;). Hence we can write 



logP = Wo \- 


^P = 


7 log A 


logPe 


log'P 


(60)(61) 


7 log p- 


logP 


l^n 


7 log p- 


(61) 


(62) 


Then, considering the bounds on 7 in Theorem 2, we examine 
the exponents of log in (65) and —pr^ in (68): 

dc>d„>3 

7 > log {dy - 1) + log {dc - 1) > log 12 

1 I log (dc-1) 

-1 7^ losfd —It f^c>f^u>3 

> — > 2. (70) 


7 


log(iL^) 1- 


log 2 


log 

It follows that if the transmit power is kept fixed even 
as Pe —)■ 0, the total power diverges as Ptotai.bdd Pt = 
U ^log^'"'® Moving to the unbounded case, substitut¬ 
ing (70) back into (68), we obtain: 


Rewriting V in terms of Pp we find the optimizing transmit 
power 

7 log p- 


Pt 


total 


= U Pt + 


'1^' 

Pt 


(71) 


Differentiating the expression inside the U( ) on the RHS 


P^ = 


21Un 


/7log+ \ 

like: 

/ 2 1 \ 

V j 


^total,min — 1 ^~) 


The first two terms in the asymptotic expansion of Wo{x) as 
cc —)■ 00 are log(a:) — loglog(a;) [71]. In fact, Vx > e [72]: 

log(a:) - loglog(x) < Wo{,x) < log(a:) - ^ loglog(a:). (64) 

Using (64) in (63), and ignoring constant terms in the resulting 
denominator, the optimizing transmit power is 

/ 7 log, ^ 

Pt = 0 


This lower bound tightens when dydc > 4((i„ + dy). Using 
Theorem 2 for this case. 


Pt 


total 


n Pt 


losA 

Pt 


V P. 


-ftotal ,r] 


log 32 ^31 

log 32+1 40 


n hog4o _ 


X log log - 2 log log log 2 WJ 

Plugging back into Lp^ in (56), and ignoring non-dominating 
terms, we get the lower bound. An identical analysis of Up^ 
in (57) gives the upper bound and completes the proof. □ 


Appendix F 
Proof of Theorem 5 

Proof of Theorem 5: Via Theorem 2 and Lemma 4, if the 
transmit power is kept fixed as Pe —?► 0, the total power is 

1 


Moving to the u^per bound, via Theorem 2, we find that the 
exponent of in (69) is 


27 < 


6 \og{2dydc + 1) 


log(A - 1) - log 2’ 

Then substituting (72) into (69), we get the bound 

61og(2d„de + l) 

P 


total 


= 0\Pt 


]Qg j_ \ logCd-i;-l)-log2 


Pt 


(72) 


(73) 


Aotal.bdd Pt = ^ 7+ -f log 


Pe 


Differentiating the expressions inside the O (•) of (73) w.r.t. 
Pt and setting to zero, we obtain the upper bound. □ 


(65) 





















































19 


Appendix G 
CAD FLOW DETAILS 

The decoding implementation models are constructed in a 
hierarchical manner. First, behavioral verilog descriptions of 
variable and check nodes are mapped to standard cells using 
logic synthesis'"^ and are then placed-and-routed using a phys¬ 
ical design tool. The physical area of the individual circuits 
is obtained. Post-layout simulation is then performed, using 
extracted RC parasitics and typical corners for the Synopsys 
32/28nm HVT CMOS process at a supply voltage of 0.78V. 
The critical-path delays of the variable-nodes TVnCo, dy ), 
and check-nodes Tct^{a,dc) for each decoding algorithm are 
obtained using post-layout static timing analysis with the 
parasitics included. Post-layout power analysis is performed 
to obtain the average power consumption of variable-nodes 
Py^{a,dy) and check-nodes Pcn(o;'^c) using a “virtual 
clock” of period TvN(a,d«) or TcN(a,cic), respectively, over 
a large number of uniformly random input patterns. In practice 
however, the amount of switching activity at the decoder 
depends on the number of errors in the received sequence 
over the channel, and it thereby depends on the parameters of 
the channel and communication system. For example, when 
the transmit power is large and/or the path-loss and noise are 
small, the expected number of errors in the received sequence 
is small and the switching activity caused by bit-flips may be 
much smaller than these simulations indicate. Nevertheless, we 
assume (with slight overestimation) that the averaged power 
numbers hold for the various check-nodes and variable-nodes. 

Appendix H 

Circuit model for critical-path delay 

It is assumed that all decoders operate at the minimum clock 
period TcLK(a, g,dy,dc) for which timing would be met at the 
0.78V supply voltage. This minimum allowable clock period 
that meets timing in flip-flop based synchronous circuits is 
bounded by the setup time constraint [19] for each flip-flop. 
The setup time is the minimum time it takes the incoming data 
to a flip-flops to propagate through the input stages of the flip- 
flop. The critical path for a full decoding iteration consists 
of a CLK-Q delay of a message passing flip-flop inside a 
variable-node, then an interconnect delay, then a check-node 
delay rcN(a, ^c), then another interconnect delay, and finally 
a variable-node delay Tyyi{a,dv). In these models, the setup 
and the CLK-Q delay are accounted for in Tyyi{a, dy). 

Interconnect delay is assumed to be linearly proportional 
to the resistance and capacitance of the interconnect, which 
depend on the length and width of interconnects. Estimating 
the length of interconnects requires an estimate of the de¬ 
coder’s physical dimensions. The total area of the decoder, 
^Decoder, L estimated as a sum of check-node and variable- 
node areas, where the nodes are assumed to be placed in a 
square arrangement. Best-case and worst-case estimates for 

*^The delay, power, area, and structure of synthesized logic depend on the 
constraints and mapping effort given as inputs to the synthesis tool. To allow 
for a fair comparison between codes of different degrees, we only specify 
constraints for minimum delay and minimum power and use the highest 
possible mapping effort for each node. 


the average interconnect length l^i,-e{a,g,dy,dc) are obtained 
by the following equations [33] 


aO.25 

■^Decoder 


^wire(^; ^c) — \ '\f~^ 


Decoder 


in best case. (74) 
in worst case. (75) 


Rigorous empirical and theoretical justification for the above 
estimates is provided in [33] where it is shown that (74) 
is a good approximation for highly-parallel logic and (75) 
is the average value for randomly-placed logic on a square 
array. Since the logic functions computed by the check-nodes 
and variable-nodes for the decoding algorithms considered 
in this paper are intrinsically parallel and we also assume 
the decoders are implemented in a fully-parallel manner, we 
used (74) for the results shown in this paper. However, we 
note that (75) could be a better approximation, depending on 
the code construction used. 

Routing for decoders is assumed to use minimum-width 
wires on the lower 7 metal layers of the 9-layer CMOS pro¬ 
cess*^. The average minimum width (wavg). sheet resistance 
(i?sq), and capacitance per-unit-length (Cunit) for these 
metal layers are calculated using design rule information [54] 
and are assumed as constants. Interconnect delay is then 
estimated assuming a distributed Elmore model [19]: 


-f^wire (it, g , dy , dc) 
^wire (it, g , dy , dy') 
Twire (it, I/, dy , dc) 


^wire(lt, P, ifu, I^c) /n£:\ 
Llgq X ( /O) 

tUavg 

C'unit X ^ wire(^i 9-! dy,d,) (77) 
77sqC'unit^wire(lt; ffi dy, dy) 

2w • ( ) 

^^avg 


Appendix I 

Circuit model for computation power 

The power consumption of a logic gate consists of both 
dynamic power (which is proportional to the activity-factor at 
the input of the gate and the clock-frequency), and static power 
(which has no dependence on the activity-factor or the clock 
frequency) [19]. In post-layout simulation, the static power 
consumption of variable-nodes and check-nodes at 0.78V 
supply in a high threshold-voltage process is observed to be 
less than 1% of the total power in check-nodes and variable- 
nodes. Therefore, with little loss in accuracy, we treat the 
total power consumption of check-nodes and variable-nodes as 
dynamic power when considering the effect of clock-frequency 
scaling. Therefore, the power consumed after clock-frequency 
scaling in variable-nodes is PYj<s{a,dy) x 
in check-nodes it is PcN(a, dc) x TvN(a,d„) 

V , cy TcLK(a,g,d„,dc) 


Appendix J 

Circuit model for interconnect power 

Using the interconnect capacitance estimate 
f^wire(it, I/, dy;, dc) and clock period p, dy,, dc) 

from Appendix H, and assuming an activity factor of 
f, the power consumed by a single message-passing 

’^The top two metal layers are often used to construct a global power grid 
for an entire chip. 

’^Including parallel-plate and fringing components [19], 
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interconnect {P^i^e{a,g,dv,dc)) in the decoder is modeled 
using the formula for the dynamic power consumed in 
interconnects [19]: 


g f dy ^ dc) — 


g: dyj dc) X (0.781^) 

2TcLK(a, g, dv, dc) 


(79) 
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